22 Criterion Validity

22.1 Introduction

Criterion validity is an important concept, especially in the field of psychometrics. It refers to the extent to which a test’s outcomes are in agreement with a certain criterion or standard.

Essentially, it’s about how well one measure can predict or correlate with an outcome from another measure.

In educational research, criterion validity helps in determining how effectively a test (like the SAT in the USA) predicts a student’s future academic success. It ensures that assessments are not just theoretically ‘sound’ but practically relevant.

22.2 Types of criterion validity

Criterion validity is primarily divided into two types: predictive and concurrent validity.

‘Predictive’ validity measures the extent to which a test predicts future outcomes. For example, in the USA, the GRE exam’s predictive validity would be assessed by how well it forecasts a student’s success in graduate school.
‘Concurrent’ validity refers to how well a test aligns with a currently existing measure. An example would be comparing the results of a new depression inventory with an established clinical assessment for depression.

22.3 Measuring criterion validity

The process of measuring criterion validity begins with the clear identification of the criterion. This criterion should be a standard or outcome that is already accepted and validated.

Once the criterion is identified, the next step is to collect test scores and criterion data. This involves administering the test and then comparing the results with the criterion data, which could be current (for concurrent validity) or future outcomes (for predictive validity)

The strength of the relationship between the test scores and the criterion data is then analysed, typically using statistical methods like correlation.

22.4 Challenges to criterion validity

One major challenge is criterion contamination, where the criterion itself may be influenced by factors other than those the test aims to measure.
Another challenge is criterion deficiency, where the criterion does not fully represent the construct the test is designed to measure.